Locating text in color documents

نویسندگان

  • Charalambos Strouthopoulos
  • Nikos Papamarkos
  • Antonios Atsalakis
  • Christodoulos Chamzas
چکیده

In complex color documents, text, drawings and graphics are appeared with millions of different colors. In many cases, text regions are overlaid onto drawings or graphics. In this paper, a new method is proposed to automatically detect and extract text in mixed type color documents. The proposed method is based on a combination of an Adaptive Color Reduction (ACR) technique and a Page Layout Analysis (PLA) approach. The ACR technique is used to obtain the optimal number of colors. Then, image is split to separable binary images, each one corresponding to every principal color. The PLA technique is applied independently to each one of the color plains and identifies the text regions. A merging procedure is applied in the final stage to merge the text regions derived from the color plains and to produce the final document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Locating XML Documents Using Content and Structure Synopses

In this paper, we present a novel framework for locating schema-less XML documents based on concise data synopses extracted from the documents. We introduce two novel data synopses, content synopsis and positional filter, to summarize the text data in an XML document for the query evaluation. These two data synopses correlate textual with positional information and consider the containment rela...

متن کامل

The locating-chromatic number for Halin graphs

Let G be a connected graph. Let f be a proper k -coloring of G and Π = (R_1, R_2, . . . , R_k) bean ordered partition of V (G) into color classes. For any vertex v of G, define the color code c_Π(v) of v with respect to Π to be a k -tuple (d(v, R_1), d(v, R_2), . . . , d(v, R_k)), where d(v, R_i) is the min{d(v, x)|x ∈ R_i}. If distinct vertices have distinct color codes, then we call f a locat...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Pii: S0031-3203(01)00167-4

Text extraction in mixed-type documents is a pre-processing and necessary stage for many document applications. In mixed-type color documents, text, drawings and graphics appear with millions of di0erent colors. In many cases, text regions are overlaid onto drawings or graphics. In this paper, a new method to automatically detect and extract text in mixed-type color documents is presented. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001